When traveling around the United States, an increasing number of people are looking to the popular service Airbnb as an alternative to more traditional options like hotels. There are several main reasons for this trend, including the flexibility of location, housing type, and affordability.
It's important for the modern traveler to understand what determines the price of an airbnb. A good question to ask might be "Does the minimum-night limit of a listing predict the cost?" Or maybe "What are the characteristics that affordable listings have in common?"
The following dataset was found on Kaggle's database, and comprises nearly a quarter of a million Airbnb listings around the United States in 2020. The primary question I plan to ask of this dataset is: "What are the best predicters of the Price of an Airbnb in the U.S.?"
Let's take a look at the data before determining a hypothesis or model.
%matplotlib inline
import warnings
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import eli5
from pandas_profiling import ProfileReport
from category_encoders.ordinal import OrdinalEncoder
from xgboost import XGBRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.inspection import permutation_importance
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score, mean_squared_error, roc_auc_score
# Ignore ugly warning messages
warnings.filterwarnings('ignore')
data_url = './data/ab_us_2020.csv'
df = pd.read_csv(data_url)
print(f"Dataset: {df.shape}")
df.head()
ProfileReport(df)
df.describe()
# Data wrangling function
def wrangle(df):
df = df.copy()
# Drop irrelevant features
df = df.drop(
columns=['id', 'name', 'host_name', 'neighbourhood_group']
)
# Convert last_review to datetime,
# and replace it with only the month
df['last_review'] = pd.to_datetime(
df['last_review'], infer_datetime_format=True
)
df['last_review'] = df['last_review'].dt.month
# Split data iinto feature and target matrices
target = 'price'
X = df.drop(target, axis=1)
y = df[target]
# Build a preprocessing pipeline to prep features for modeling
cat_features = [col for col in X.columns if X[col].dtype == 'object']
num_features = [col for col in X.columns if col not in cat_features]
cat_pipe = make_pipeline(
SimpleImputer(strategy='constant', fill_value='missing'), OrdinalEncoder(handle_unknown='return_nan')
)
num_pipe = make_pipeline(
SimpleImputer(strategy='median'), StandardScaler()
)
preprocessor = ColumnTransformer(transformers=[
('category transformer', cat_pipe, cat_features),
('numeric transformer', num_pipe, num_features)
])
# Transform the features
X = preprocessor.fit_transform(X)
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2
)
# Return the transformed feature matrices and target vectors
return X_train, X_test, y_train, y_test
# Drop outlier (minimum_nights = 100000000000000)
df.drop(df['minimum_nights'].argmax(), inplace=True)
print(f"Dataset: {df.shape}")
# Wrangle the data
X_train, X_test, y_train, y_test = wrangle(df)
print(f"""
Training set: {len(X_train) / len(df) *100}%
Testing set: {len(X_test) / len(df) *100}%
""")
My goal is to build a predictive model using Python which most accurately predicts the price of an airbnb. This is a regression problem, meaning the variable I'm targeting (price) could be an infinite number of values. First, I'll fitt a standard linear regression model from Scikit-Learn.
A linear regression model uses all the algebraic mind gymnastics we learned in high school, like slope and y-intercept, to fit a line to the input data (features of the airbnb listings.) It then applies the equation of that line to predict the price of future listings.
I want to test this model using two accuracy scoring methods:
R-squared score (R2): Measures the proportion of the predicted outcome whose variance was accurately replicated by the linear model.
Mean-squared error (MSE): Measures the square of the average difference between the predicted and actual values.
Let's fit and test a linear regression model: